Semi-supervised Bibliographic Element Segmentation with Latent Permutations
Identifieur interne : 000440 ( Main/Exploration ); précédent : 000439; suivant : 000441Semi-supervised Bibliographic Element Segmentation with Latent Permutations
Auteurs : Tomonari Masada [Japon] ; Atsuhiro Takasu [Japon] ; Yuichiro Shibata [Japon] ; Kiyoshi Oguri [Japon]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2011.
Abstract
Abstract: This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.
Url:
DOI: 10.1007/978-3-642-24826-9_11
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000B19
- to stream Istex, to step Curation: 000B06
- to stream Istex, to step Checkpoint: 000096
- to stream Main, to step Merge: 000445
- to stream Main, to step Curation: 000440
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Semi-supervised Bibliographic Element Segmentation with Latent Permutations</title>
<author><name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
</author>
<author><name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
</author>
<author><name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
</author>
<author><name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC</idno>
<date when="2011" year="2011">2011</date>
<idno type="doi">10.1007/978-3-642-24826-9_11</idno>
<idno type="url">https://api.istex.fr/document/FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000B19</idno>
<idno type="wicri:Area/Istex/Curation">000B06</idno>
<idno type="wicri:Area/Istex/Checkpoint">000096</idno>
<idno type="wicri:doubleKey">0302-9743:2011:Masada T:semi:supervised:bibliographic</idno>
<idno type="wicri:Area/Main/Merge">000445</idno>
<idno type="wicri:Area/Main/Curation">000440</idno>
<idno type="wicri:Area/Main/Exploration">000440</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Semi-supervised Bibliographic Element Segmentation with Latent Permutations</title>
<author><name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki</wicri:regionArea>
<wicri:noRegion>Nagasaki</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
<affiliation wicri:level="3"><country xml:lang="fr">Japon</country>
<wicri:regionArea>National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo</wicri:regionArea>
<placeName><settlement type="city">Tokyo</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki</wicri:regionArea>
<wicri:noRegion>Nagasaki</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki</wicri:regionArea>
<wicri:noRegion>Nagasaki</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2011</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC</idno>
<idno type="DOI">10.1007/978-3-642-24826-9_11</idno>
<idno type="ChapterID">11</idno>
<idno type="ChapterID">Chap11</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
</country>
<settlement><li>Tokyo</li>
</settlement>
</list>
<tree><country name="Japon"><noRegion><name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
</noRegion>
<name sortKey="Masada, Tomonari" sort="Masada, Tomonari" uniqKey="Masada T" first="Tomonari" last="Masada">Tomonari Masada</name>
<name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
<name sortKey="Oguri, Kiyoshi" sort="Oguri, Kiyoshi" uniqKey="Oguri K" first="Kiyoshi" last="Oguri">Kiyoshi Oguri</name>
<name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
<name sortKey="Shibata, Yuichiro" sort="Shibata, Yuichiro" uniqKey="Shibata Y" first="Yuichiro" last="Shibata">Yuichiro Shibata</name>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
<name sortKey="Takasu, Atsuhiro" sort="Takasu, Atsuhiro" uniqKey="Takasu A" first="Atsuhiro" last="Takasu">Atsuhiro Takasu</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000440 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000440 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:FDBFD7EB8D5D22EB3CE0996E9EA49BDEA3B33DFC |texte= Semi-supervised Bibliographic Element Segmentation with Latent Permutations }}
This area was generated with Dilib version V0.6.32. |